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Abstract 

We propose a general algorithmic framework for constrained matrix and tensor factorization, which 
is widely used in signal processing and machine learning. The new framework is a hybrid between 
alternating optimization (AO) and the alternating direction method of multipliers (ADMM): each matrix 
factor is updated in turn, using ADMM, hence the name AO-ADMM. This combination can naturally 
accommodate a great variety of constraints on the factor matrices, and almost all possible loss measures 
for the fitting. Computation caching and warm start strategies are used to ensure that each update is 
evaluated efficiently, while the outer AO framework exploits recent developments in block coordinate 
descent (BCD)-type methods which help ensure that every limit point is a stationary point, as well as 
faster and more robust convergence in practice. Three special cases are studied in detail: non-negative 
matrix/tensor factorization, constrained matrix/tensor completion, and dictionary learning. Extensive 
simulations and experiments with real data are used to showcase the effectiveness and broad applicability 
of the proposed framework. 

Keywords: constrained matrix/tensor factorization, non-negative matrix/tensor factorization, canonical 
polyadic decomposition, PARAFAC, matrix/tensor completion, dictionary learning, alternating optimiza¬ 
tion, alternating direction method of multipliers. 


1 Introduction 

Constrained matrix and tensor factorization techniques are widely used for latent parameter estimation and 
blind source separation in signal processing, dimensionality reduction and clustering in machine learning, 
and numerous other applications in diverse disciplines, such as chemistry and psychology. Least-squares low- 
rank factorization of matrices and tensors without additional constraints is relatively well-studied, as in the 
matrix case the basis of any solution is simply the principal components of the singular value decomposition 
(SVD) [19], also known as principal component analysis (PCA), and in the tensor case alternating least 
squares (ALS) and other algorithms usually yield satisfactory results [54]. ALS is also used for matrix 
factorization, especially when the size is so large that performing the exact PCA is too expensive. 

Whereas unconstrained matrix and tensor factorization algorithms are relatively mature, their constrained 
counterparts leave much to be desired as of this writing, and a unified framework that can easily and 
naturally incorporate multiple constraints on the latent factors is sorely missing. Existing algorithms are 
usually only able to handle one or at most few specialized constraints, and/or the algorithm needs to be 
redesigned carefully if new constraints are added. Commonly adopted constraints imposed on the latent 
factors include non-negativity [36], sparsity (usually via sparsity-inducing li regularization [41]), and simplex 
constraints [27], to name just a few. 

On top of the need to incorporate constraints on the latent factors, many established and emerging 
signal processing applications entail cost (loss) functions that differ from classical least-squares. Important 
examples include matrix completion [12] where missing values are ignored by the loss function, and robust 
PCA [11] where the h loss is used. In the matrix case without constraints on the latent factors, these can 
be formulated as convex problems and solved in polynomial-time. With explicit constraints imposed on the 
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latent factors, and/or for tensor data, however, non-convex (multi-linear) formulations are unavoidable, and 
a unified algorithmic framework that can handle a variety of constraints and loss functions would be very 
welcome. 

In this paper, we propose a general algorithmic framework that seamlessly and relatively effortlessly 
incorporates many common types of constraints and loss functions, building upon and bridging together 
the alternating optimization (AO) framework and the alternating direction method of multipliers (ADMM), 
hence the name AO-ADMM. 

While combining these frameworks may seem conceptually straightforward at first sight, what is signif¬ 
icant is that AO-ADMM outperforms all prior algorithms for constrained matrix and tensor factorization 
under nonparametric constraints on the latent factors. One example is non-negative matrix factorization, 
where the prior art includes decades of research. This is the biggest but not the only advantage of AO- 
ADMM. Carefully developing various aspects of this combination, we show that 

• AO-ADMM converges to a stationary point of the original NP-hard problem; 

• Using computation caching, warm-start, and good parameter settings, its per-iteration complexity is 
similar to that of ALS; 

• AO-ADMM can incorporate a wide-range of constraints and regularization penalties on the latent 
factors at essentially the same complexity; 

• It can also accommodate a wide variety of cost / loss functions, with only moderate increase in com¬ 
plexity relative to the classical least-squares loss; and 

• The core computations are exactly the same as ALS for unconstrained factorization, with some addi¬ 
tional element-wise operations to handle constraints, making it easy to incorporate smart implemen¬ 
tations of ALS, including sparse, parallel, and high-performance computing enhancements. 

1.1 Notation 

We denote the (approximate) factorization of a matrix Y « WH^, where Y is m x n, W is mx k, and H is 
n X k, with k < m,n, and in most cases much smaller. Note that adding constraints on W and H may turn 
the solution from easy to find (via SVD) but non-identifiable, to NP-hard to find but identifiable. It has 
been shown that simple constraints like non-negativity and sparsity can make the factors essentially unique, 
but at the same time, computing the optimal solution becomes NP-hard—see [29] and references therein. 

An Y-way array of dimension ni x n 2 x ... x un, with Y > 3, is denoted with an underscore, e.g., 
Y . In what follows, we focus on the so-called parallel factor analysis (PARAFAC) model, also known as 
canonical decomposition (CANDECOMP) or canonical polyadic decomposition (CPD), which is essentially 
unique under mild conditions [48], but constraints certainly help enhance estimation performance, and even 
identifiability. The factorization is denoted as Y « which is a concise way of representing the model 

k N 

Y(*i,..., iat ) w EHh d{id,j), 'iii,...,iN- 
j=l d=l 

Each matrix Hd is of size Ud x k, corresponding to the factor of the d-th mode. 

1.2 Multi-linear Algebra Basics 

With the increasing interest in tensor data processing, there exist many tutorials on this topic, for example, 
[34,49]. Here we briefly review some basic multi-linear operations that will be useful for the purposes of 
this paper, and refer the readers to those tutorials and the references therein for a more comprehensive 
introduction. 

The mode-d matricization, also known as mode-d matrix unfolding, of Y, denoted as Y(d), is a matrix 
of size n^i j^d ^ E^ch row of Y(d) is a vector obtained by fixing all the indices of Y except the d-th 
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one, and the matrix is formed by stacking these row vectors by traversing the rest of the indices from N 
back to 1. As an example, for iV = 3, the three matricizations are 


Y(i)= 


A(2) = 

■Y(1, 

.Y(3)= 

■Y(1,:,:)' 


Y(:,n2,:)y 


Y(ni,Oy 


Y(ni,:,:)_ 


Notice that, though essentially in the same spirit, this definition of mode-d matricization may be different 
from other expressions that have appeared in the literature, but we adopt this one for ease of our use. 

The Khatri-Rao product of matrices A and B having the same number of columns, denoted as A©B, 
is defined as the column-wise Kronecker product of A and B. More explicitly, if A is of size n x k, then 


A©B = [A(:,1)©B(:,1) 
'A(1,1)B(:,1) 

A(n,l)B(:,l) 


A(:,fc)©B(:,fc)] 
A(l,fc)B(:,fc)' 

A(n,fc)B(:,fc)_ 


The Khatri-Rao product is associative (although not commutative). We therefore generalize the operator © 
to accept more than two arguments in the following way 

N 

© H, = Hi © • • • © Hd_i © Hd+i © • • • © Hn- 
j=i 

With the help of this notation, if Y admits an exact PARAFAC model Y = then it can be expressed 

in matricized form as 



Lastly, a nice property of the Khatri-Rao product is that 

(A © B)^ (A © B) = A'^A ® B^B, 

where @ denotes the element-wise (Hadamard) matrix product. More generally, it holds that 



N \ N rp 

© H, = ® H[H,. 

j=l d I 3 

j^d / j^d 


2 Alternating Optimization Framework: Preliminaries 

We start by formulating the factorization problem as an optimization problem in the most general form 

N 

/(Y-[H,]^^i)+^r,(H,), (1) 


minimize 


d=l 


with a slight abuse of notation by assuming N can also take the value of 2. In (1), /(•) can be any loss measure, 
most likely separable down to the entries of the argument, and r£i(H£i) is the generalized regularization on 
Hd, which may take the value of -foo so that any hard constraints can also be incorporated. For example, 
if we require that the elements of are nonnegative, denoted as Hd > 0, then 


rd(Hd) = 


0, Hd > 0, 

- 1 - 00 , otherwise. 
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Because of the multi-linear term the regularized fitting problem is non-convex, and in many 

cases NP-hard [26,58]. A common way to handle this is to use the alternating optimization (AO) technique, 
i.e., update each factor in a cyclic fashion. The popular ALS algorithm is a special case of this when l{-) is 
the least-squares loss, and there is no regularization. In this section, we will first revisit the ALS algorithm, 
with the focus on the per-iteration complexity analysis. Then, we will briefly discuss the convergence of 
the AO framework, especially some recent advances on the convergence of the traditional block coordinate 
descent (BCD) algorithm. 


2.1 Alternating Least-Squares Revisited 

Consider the unconstrained matrix factorization problem 

minimize -||Y —WH^jjp, (2) 

WH 2" ^ ^ 


and momentarily ignore the fact that the optimal solution of (2) is given by the SVD. The problem (2) is 
non-convex in W and H jointly, but is convex if we fix one and only treat the other as variable. Supposing 
W is fixed, the sub-problem for H becomes the classical linear least squares and, if W has full column rank, 
the unique solution is given by 

= (W^W)-^W^Y. (3) 

In practice, the matrix inverse (W^W)-i is almost never explicitly calculated. Instead, the Cholesky 
decomposition of the Gram matrix W^W is computed, and for each column of W^Y, a forward and a 
backward substitution are performed to get the corresponding column of H^. Since W is m x k and Y is 
m X n, forming W^W and W^Y takes 0{mk^) and 0{mnk) flops, respectively, computing the Cholesky 
decomposition requires 0{k^) flops, and finally the back substitution step takes 0{nk‘^) flops, similar to a 
matrix multiplication. If TO,n > k, then the overall complexity is 0{mnk). 

An important implication is the following. Clearly, if n = I, then the cost of a least squares is 0{mk^). 
However, as n grows, the complexity does not simply grow proportionally with n, but rather goes to 0{mnk). 
The reason is that, although it seems we are now trying to solve n least-squares problems, they all share the 
same matrix W, thus the Cholesky decomposition of W^W can be reused throughout. This is a very nice 
property of the unconstrained least squares problems, which can be exploited to improve the computational 
efficiency of the ALS algorithm. 

One may recall that another well-adopted method for least-squares is to compute the QR decomposition 
of W as W = QR, so that = R-^Q^Y. This can be shown to give the same computational complexity 
as the Cholesky version, and is actually more stable numerically. However, if W has some special structure, 
it is easier to exploit this structure if we use Cholesky decomposition. Therefore, in this paper we only 
consider solving least-squares problems using the Cholesky decomposition. 

One important structure that we encounter is in the tensor case. For the ALS algorithm for PARAFAC, 
the update of is the solution of the following least squares problem 


and the solution is given by 


minimize 



( . \ 


Y(d)- 1 

, .0 H, 

1 

1 d=i ] 



VjAd / 



Hf 


J^d ) 



T 


Y(.). 


As we can see, the Gram matrix is computed efficiently by exploiting the structure, and its Cholesky 
decomposition can be reused. The most expensive operation is actually the computation of (©j^d Hj)^Y(£;), 
but very efficient algorithms for this (that work without explicitly forming the Khatri-Rao product and the 
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d-mode matricization) are available [4,16,30,43,50]. If we were to adopt the QR decomposition approach, 
however, none of these methods could be applied. 

In summary, least squares is a very mature technique with many favorable properties that render the ALS 
algorithm very efficient. On the other hand, most of the algorithms that deal with problems with constraints 
on the factors or different loss measures do not inherit these good properties. The goal of this paper is 
to propose an AO-based algorithmic framework, which can easily handle many types of constraints on the 
latent factors and many loss functions, with per-iteration complexity essentially the same as the complexity 
of an ALS step. 

2.2 The Convergence of AO 

Consider the following (usually non-convex) optimization problem with variables separated into N blocks, 
each with its own constraint set 


minimize /(xi,..., xjv) 

X1,...,XJV 

subject to Kd G Xd, Vd = I, 


(4) 


A classical AO method called block coordinate descent (BCD) cyclically updates x^ via solving 

minimize ,..., x^I^, 5, x^+i,..., Xjy) 

subject to ^ G Xd, 


(5) 


at the (r -I- l)-st iteration [6, §2.7]. Obviously, this will decrease the objective function monotonically. If 
some additional assumptions are satisfied, then we can have stronger convergence claims [6, Proposition 
2.7.1]. Simply put, if the sub-problem (5) is convex and has a unique solution, then every limit point is a 
stationary point; furthermore, if Ai,..., are all compact, which implies that the sequence generated by 
BCD is bounded, then BCD is guaranteed to converge to a stationary point, even if (4) is non-convex [56]. 

In many cases (5) is convex, but the uniqueness of the solution is very hard to guarantee. A special 
case that does not require uniqueness, first noticed by Grippo and Sciandrone [25], is when N = 2. On 
hindsight, this can be explained by the fact that for N = 2, BCD coincides with the so-called maximum 
block improvement (MBI) algorithm [14], which converges under very mild conditions. However, instead of 
updating the blocks cyclically, MBI only updates the one block that decreases the objective the most, thus 
the per-iteration complexity is {N — 1) times higher than BCD; therefore MBI is not commonly used in 
practice when N is large. 

Another way to ensure convergence, proposed by Razaviyayn et al. [44], is as follows. Instead of updating 
Xd as the solution of (5), the update is obtained by solving a majorized version of (5), called the block 
successive upper-bound minimization (BSUM). The convergence of BSUM is essentially the same, but now 
we can deliberately design the majorizing function to ensure that the solution is unique. One simple way to 
do this is to put a proximal regularization term 


minimize ..., x^j^, ^, x]]^!, ...,x)v) + |llC-xSf 

subject to ^ G Xd, 


( 6 ) 


for some fx > 0, where x]] is the update of x^ from the previous iteration. If (5) is convex, then (6) is strongly 
convex, which gives a unique minimizer. Thus, the algorithm is guaranteed to converge to a stationary point, 
as long as the sequence generated by the algorithm is bounded. Similar results are also proved in [59], where 
the authors used a different majorization for constrained matrix/tensor factorization; we shall compare with 
them in numerical experiments. 
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3 Solving the Sub-problems Using ADMM 

The AO algorithm framework is usually adopted when each of the sub-problems can be solved efficiently. This 
is indeed the case for the ALS algorithm, since each update is in closed-form. For the general factorization 
problem (1), we denote the sub-problem as 

minimize l(Y - WH^)-f r(H). (7) 

H 

For the matrix case, this is simply the sub-problem for the right factor, and one can easily figure out the 
update of the left factor by transposing everything; for the tensor case, this becomes the update of by 
setting Y = y{d) and W = Qj^d Hj ■ This is for ease of notation, as these matricizations and Khatri-Rao 
products need not be actually computed explicitly. Also notice that this is the sub-problem for the BCD 
algorithm, and for better convergence we may want to add a proximal regularization term to (7), which is 
very easy to handle, thus omitted here. 

We propose to use the alternating direction method of multipliers (ADMM) to solve (7). ADMM, if used 
in the right way, inherits a lot of the good properties that appeared in each update of the ALS method. 
Furthermore, the AO framework naturally provides good initializations for ADMM, which further accelerates 
its convergence for the subproblem. As a preview, the implicit message here is that closed-form solution is 
not necessary for eomputational efficiency, as we will explain later. After a brief introduction of ADMM, we 
first apply it to (7) which has least-squares loss, and then generalize it to universal loss measures. 

3.1 Alternating Direction Method of Multipliers 

ADMM solves convex optimization problems that can be written in the form 

minimize /(x) -|- g{z) 

X,Z 

subject to Ax -f Bz = c. 


by iterating the following updates 

X ^ argmin/(x) -|- (p/2)||Ax -f Bz - c -|- uH^, 

X 

z argming(z) -|- {p/2)\\Ax -|- Bz — c -f u||2, 

Z 

u u -f (Ax -f Bz — c), 

where u is a scaled version of the dual variables corresponding to the equality constraint Ax -f Bz = c, and 
p is specified by the user. 

A comprehensive review of the ADMM algorithm can be found in [7] and the references therein. The 
beauty of ADMM is that it converges under mild conditions (in the convex case), while artful splitting of the 
variables into the two blocks x and z can yield very efficient updates, and/or distributed implementation. 
Furthermore, if / is strongly convex and Lipschitz continuous, then linear convergence of ADMM can be 
achieved; cf. guidelines on the optimal step-size p in [46, § 9.3], and [22] for an analysis of ADMM applied 
to quadratic programming. 

3.2 Least-Squares Loss 

We start by considering l{-) in (7) to be the least-squares loss (1/2)]] • 11|.. The problem is reformulated by 
introducing a fc x n auxiliary variable H 

minirnize ijJY-WHll^ + r(H) 

H.H 2 (8) 

subject to H = H^. 
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It is easy to adopt the ADMM algorithm and derive the following iterates: 

H ^ (W^W + pI)-i(W^Y + p(H + U)^), 

H^argminr(H) +J||H-H^ + U||^, (9) 

H Z 

One important observation is that, throughout the iterations the same matrix W^Y and matrix inversion 
(W^W + pl)“^ are used. Therefore, to save computations, we can cache W^Y and the Cholesky decom¬ 
position of W^W -I- pi = LL^. Then the update of H is dominated by one forward substitution and one 
backward substitution, resulting in a complexity of 0{k‘^n). 

The update of H is the so-called proximity operator of the function (l/p)r(-) around point (H^ — U), 
and in particular if r(-) is the indicator function of a convex set, then the update of H becomes a projection 
operator, a special case of the proximity operator. For a lot of regularizations/constraints, especially those 
that are often used in matrix/tensor factorization problems, the update of H boils down to element-wise 
updates, i.e., costing 0(kn) flops. Here we list some of the most commonly used constraints/regularizations 
in the matrix factorization problem, and refer the reader to [42, §6]. For simplicity of notation, let us define 
H = - U. 

• Non-negativity. In this case r(-) is the indicator function of M+, and the update is simply zeroing out 
the negative values of H. In fact, any element-wise bound constraints can be handled similarly, since 
element-wise projection is trivial. 

• Lasso regularization. For r(H) = A||H||i, the sparsity inducing regularization, the update is the well- 
known soft-thresholding operator: hij = [1 — {X/p)\hij\~^]+hij. The element-wise thresholding can 
also be converted to block-wise thresholding if one wants to impose structured sparsity, leading to the 
group Lasso regularization. 

• Simplex constraint. In some probabilistic model analysis we need to constrain the columns or rows to 
be element-wise non-negative and sum up to one. As described in [18], this projection can be done 
with a randomized algorithm with linear-time complexity on average. 

• Smoothness regularization. We can encourage the columns of H to be smooth by adding the regular¬ 
ization r(H) = (A/2)||TH||| where T is an n x n tri-diagonal matrix with 2 on the diagonal and — 1 
on the super- and sub-diagonal. Its proximity operator is given by H = p(AT^T -|- /9l)“^H. Although 
it involves a large matrix inversion, notice that it has a fixed bandwidth of 2, thus can be efficiently 
calculated in 0(kn) time [24, §4.3]. 

We found empirically that by setting p= ||W|||./fc, the ADMM iterates for the regularized least-squares 
problem (8) converge very fast. This choise of p can be seen as an approximation to the optimal p given 
in [46], but much cheaper to obtain. With a good initialization, naturally provided by the AO framework, 
the update of H usually does not take more than 5 or 10 ADMM iterations, and very soon reduces down to 
only 1 iteration. The proposed algorithm for the sub-problem (8) is summarized in Alg. 1. As we can see, 
the pre-calculation step takes 0{k'^m k^) flops to form the Cholesky decomposition, and 0{mnk) flops to 

form F. Notice that these are actually the only computations in Alg. 1 that involve W and Y, which implies 
that in the tensor case, all the tricks to compute W^W and W^Y can be applied here, and then we do not 
need to worry about them anymore. The computational load of each ADMM iteration is dominated by the 
H-update, with complexity 0{k‘^n). 

It is interesting to compare Alg. 1 with an update of the ALS algorithm, whose complexity is essentially 
the same as the pre-calculation step plus one iteration. For a small number of ADMM iterations, the 
complexity of Alg. I is of the same order as an ALS step. 

For declaring termination, we adopted the general termination criterion described in [7, §3.3.1]. After 
some calibration, we define the relative primal residual 

r=||H-H^|||,/||H|||,, (10) 
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Algorithm 1: Solve (8) using ADMM 


Input: Y, W, H, U, k 

1 Initialize H and U; 

2 G = W^W; 

3 p = trace(G)/A: ; 

4 Calculate L from the Cholesky decomposition of G + pi = LL^; 

5 F = W^Y ; 

6 repeat 

7 H^L-^L~i(F + p(H + U)^ ) using forward/backward substitution ; 

8 H^argminHr(H) + f||H-H^ + U||| ; 

9 U^U + H-H^; 

10 until r < e and s < e r and s defined in (10) and (11); 

11 return H and U. 


and the relative dual residual 

s=||H-Ho|||/||U|||, (11) 

where Ho is H from the previous ADMM iteration, and terminate Alg. 1 if both of them are smaller than 
some threshold. 

Furthermore, if the BSUM framework is adopted, we need to solve a proximal regularized version of (8), 
and that term can easily be absorbed into the update of H. 


3.3 General Loss 

Now let us derive an ADMM algorithm to solve the more general problem (7). For this case, we reformulate 
the problem by introducing two auxiliary variables H and Y 


minimize l(Y — Y) + r(H) 

H,H,Y 

subject to H = H'^, Y = WH. 


( 12 ) 


To apply ADMM, let H be the first block, and (Y, H) be the second block, and notice that in the second 
block update Y and H can in fact be updated independently. This yields the following iterates: 

H 4- (W'^W + pI)-^(W^(Y+V) + p(H + U)^) 

r H 4- argimn r(H) + ^||H - + Uf^, 

I Y4-argmin Z(Y - Y) + i||Y - WH + V|||, (13) 

1 Y 2 

JU^U + H-H^, 

1 V 4- V + Y - WH. 


where U is the scaled dual variable corresponding to the constraint H = H^, and V is the scaled dual 
variable corresponding to the equality constraint Y = WH. Notice that we set the penalty parameter p 
corresponding to the second constraint to be 1, since it works very well in practice, and also leads to very 
intuitive results for some loss functions. This can also be interpreted as first pre-conditioning this constraint 
to be = -^WH, and then a common p is used. Again we set p = ||W|||./fc. 

As we can see, the update of H is simply a linear least squares problem, and all the previous discussion 
about caching the Cholesky decomposition applies. It is also easy to absorb an additional proximal regular¬ 
ization term into the update of H, if the BSUM framework is adopted. The update of Y is (similar to the 
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update of H) a proximity operator, and since almost all loss functions we use are element-wise, the update 
of Y is also very easy. The updates for some of the most commonly used non-least-squares loss functions 
are listed below. For simplicity, we define Y = WH — V, similar to the previous sub-section. 

• Missing values. In the case that only a subset of the entries in Y are available, a common way to 
handle this is to simply fit the low-rank model only to the available entries. Let A denote the index set 
of the available values in Y, then the loss function becomes /(Y — Y) = i — VijY■ Thus, 

the update of Y in (13) becomes 


I hivio+ytj), 

( yij, otherwise. 


• Robust fitting. In the case that data entries are not uniformly corrupted by noise but only sparingly 
corrupted by outliers, or when the noise is dense but heavy-tailed (e.g., Laplacian-distributed), we can 
use the h norm as the loss function for robust (resp. maximum-likelihood) fitting, i.e., /(Y — Y) = 
||Y — Y||i. This is similar to the li regularization, and the element-wise update is 

f Uijj IVij ~ Vijl — 1) 

yij = \ yij ~ 1) yij ~ yij ^ i) 

yij T 1, ijij ijij < 1. 


• Huber fitting. Another way to deal with possible outliers in Y is to use the Huber function to measure 
the loss 1{Y - Y) = <j)\{yij - yij) where 


<P\{z) 


2^ ’ 


A|z| - 


The element-wise closed-form update is 


k| < A, 
otherwise. 


r 2iyij~^yij)j 
yij — \ Vij ~ X, 

[ yij + 


\yij yij\ — ‘^x^ 
yij yij ^ ‘^x, 

yij yij ^ 


• Kullback-Leibler divergence. A commonly adopted loss function for non-negative integer data is the 
Kullback-Leibler (K-L) divergence defined as 


7^(Y||Y)=^ 


1 I ~ 

yij log ^3- yij + yij 

yij 


for which the proximity operator is 

Y = ^(Y - 1) + ^(Y-1)2 + 4y) , 

where all the operations are taken element-wise [53]. Furthermore, the K-L divergence is a special 
case of certain families of divergence functions, such as a-divergence and /3-divergence [17], whose 
corresponding npdates are also very easy to derive (boil down to the proximity operator of a scalar 
function). 

An interesting observation is that if the loss function is in fact the least-squares loss, the matrix (Y -|- V) 
that H is trying to fit in (13) is the data matrix Y per se. Therefore, the update rule (13) boils down to 
the update rule (9) in the least-squares loss case, with some redundant updates of Y and V. The detailed 
ADMM algorithm for (12) is summarized in Alg. 2. We use the same termination criterion as in Alg. 1. 
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Algorithm 2: Solve (12) using ADMM 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 


Input: Y, W, H, U, Y, V, k 

Initialize H, U, Y, and V; 

G = W^W; 

p = trace(G)/A: ; 

Calculate L from the Cholesky decomposition of G + pi = LL^; 
repeat 

H ^ L-^L-i(W'^(Y H- V) + p(H + U)^ ) using forward/backward substitution ; 
H^argminHr(H) + f||H-H^ + U||| ; 

Y^argminY ^(Y - Y) + i||Y - WH + V||2, ; 

U ^ U + H - ; 

V ^ V + Y- WH ; 

until r < e and s < e r and s defined in (10) and (11); 

return H, U, Y, and V. 


Everything seems to be in place to seamlessly move from the least-squares loss to arbitrary loss. Never¬ 
theless, closer scrutiny reveals that some compromises must be made to take this leap. One relatively minor 
downside is that with a general loss function we may lose the linear convergence rate of ADMM ~ albeit 
with the good initialization naturally provided by the AO framework and our particular choice of p, it still 
converges very fast in practice. The biggest drawback is that, by introducing the auxiliary variable Y and 
its dual variable V, the big matrix product W^(Y -f V) must be re-computed in each ADMM iteration, 
whereas in the previous case one only needs to compute W^Y once. This is the price we must pay; but it 
can be moderated by controlling the maximum number of ADMM iterations. 

Scalability considerations: As big data analytics become increasingly common, it is important to keep 
scalability issues in mind as we develop new analysis methodologies and algorithms. Big data Y is usually 
stored as a sparse array, i.e., a list of (index,value) pairs, with the unlisted entries regarded as zeros or 
missing. With the introduction of Y and V, both of size(Y), one hopes to be able to avoid dense operations. 
Fortunately, for some commonly used loss functions, this is possible. Notice that by defining Y = WH — V, 
the V-update essentially becomes 

Y-Y, 

which means a significant portion of entries in V are constants—0 if the entries are regarded as missing, ±1 
or ±A in the robust fitting or Huber fitting case if the entries are regarded as “corrupted”—thus they can be 
efficiently stored as a sparse array. As for Y, one can simply generate it “on-the-fly” using the closed-form 
we provided earlier (notice that Y has the memory-efficient “low-rank plus sparse” structure). The only 
occasion that Y is needed is when computing W^Y. 

4 Summary of the Proposed Algorithm 

We propose to use Alg. 1 or 2 as the core sub-routine for alternating optimization. The proposed “universal” 
multi-linear factorization algorithm is summarized as Alg. 3. A few remarks on implementing Alg. 3 are in 
order. 

Since each factor is updated in a cyclic fashion, one expects that after a certain number of cycles 
Hd (and its dual variable U^) obtained in the previous iteration will not be very far away from the update 
for the current iteration. In this sense, the outer AO framework naturally provides a good initial point to 
the inner ADMM iteration. With this warm-start strategy, the optimality gap for the sub-problem is then 
bounded by the per-step improvement of the AO algorithm, which is small. This mode of operation is crucial 
for insuring the efficiency of Alg. 3. Our experiments suggest that soon after an initial transient stage, the 
sub-problems can be solved in just one ADMM iteration (with reasonable precision). 
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Algorithm 3: Proposed algorithm for (1) 

1 Initialize 

2 Initialize Ui,...,Ujv to be all zero matrices; 

3 if least-squares loss then 

4 repeat 

5 for d = 1, N do 

6 Y = Snd W = Qj^d ; // not necessarily formed explicitly 

7 update Hd and Ud using Alg. 1 initialized with the previous and U^; 

8 end 

9 update Id if necessary ; // refer to (14) 

10 until convergence] 

11 else 

12 Initialize Y ^ Y. V ^ 0; 

13 repeat 

14 for d = 1, N do 

15 Y = ''^(d) Snd 'W = Qj^d Hy ; // not necessarily formed explicitly 

16 update Hd,U(i,Y(j;),V(c;) using Alg. 2 initialized with the previous Hd,U(i,Y(£;),V(d); 

17 end 

18 update Id if necessary ; // refer to (14) 

19 until convergence] 

20 end 


Similar ideas can be used for Y and V in the matrix case if we want to deal with non-least-squares loss, 
and actually only one copy of them is needed in the updates of both factors. A few different options are 
available in the tensor case. If memory is not an issue in terms of the size of Y, a convenient approach 
that is commonly adopted in ALS implementations is to store all N matricizations Y^^),..., Y(^), so they are 
readily available without need for repetitive data re-shuffling during run-time. If this practice is adopted, 
then it makes sense to also have N copies of Y and V, in order to save computation. Depending on the size 
and nature of the data and how it is stored, it may be completely unrealistic to keep multiple copies of the 
data and the auxiliary variables, at which point our earlier discussion on scalable implementation of Alg. 2 
for big but sparse data can be instrumental. 

Sometimes an additional proximal regularization is added to the sub-problems. The benefit is two-fold: 
it helps the convergence of the AO outer-loop when N > 3] while for the ADMM inner-loop it improves the 
conditioning of the sub-problem, which may accelerate the convergence of ADMM, especially in the general 
loss function case when we do not have strong convexity. The convergence of AO-ADMM is summarized in 
Proposition 1. 

Proposition 1. If the sequence generated by AO-ADMM in Alg. 3 is bounded, then for 

1. N = 2, 

2. N > 2,id> 0, 

AO-ADMM converges to a stationary point of (1). 

Proof. The first case with /i = 0 is covered in [14, Theorem 3.1], and the cases when qi > Q are covered 
in [44, Theorem 2]. □ 

Note that for N = 2, using ^ = 0 yields faster convergence than ^ > 0. For N > 2, i.e., for tensor data, 
we can update /i as follows 

+ (14) 
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which was proposed in [44] for unconstrained tensor factorization, and works very well in our context as well. 

The convergence result in Proposition 1 has an additional assumption that the sequence generated by the 
algorithm is bounded. For unconstrained PARAFAC, diverging components may be encountered during AO 
iterations [35,52], but adding Frobenious norm regularization for each matrix factor (with a small weight) 
ensures that the iterates remain bounded. 

As we can see, the ADMM is an appealing sub-routine for alternating optimization, leading to a simple 
plug-and-play generalization of the workhorse ALS algorithm. Theoretically, they share the same per- 
iteration complexity if the number of inner ADMM iterations is small, which is true in practice, after an 
initial transient. Efficient implementation of the overall algorithm should include data-structure-specific 
algorithms for W^Y or which dominate the per-iteration complexity, and may include 

parallel/distributed computation along the lines of [38]. 

Finally, if a non-least-squares loss is to be used, we suggest that the least-squares loss is first employed 
to get preliminary estimates (using Alg. 3 calling Alg. 1) which can then be fed as initialization to run 
Alg. 3 calling Alg. 2. The main disadvantage of Alg. 2 compared to Alg. 1 is that the big matrix (or tensor) 
multiplication W^CY + V) needs to be calculated in each ADMM iteration. Therefore, this strategy can 
save a significant amount of computations at the initial stage. 

5 Case Studies and Numerical Results 

In this section we will study some well-known constrained matrix/tensor factorization problems, derive the 
corresponding update for H in Alg. 1 or H and Y in Alg. 2, and compare it to some of the state-of-the- 
art algorithms for that problem. In all examples we denote our proposed algorithm as AO-ADMM. All 
experiments are performed in MATLAB 2015a on a Linux server with 32 Xeon 2.00GHz cores and 128GB 
memory. 

5.1 Non-negative Matrix and Tensor Factorization 

Perhaps the most common constraint imposed on the latent factors is non-negativity - which is often sup¬ 
ported by physical considerations (e.g., when the latent factors represent chemical concentrations, or power 
spectral densities) or other prior information, or simply because non-negativity sometimes yields interpretable 
factors [36]. Due to the popularity and wide range of applications of NMF, numerous algorithms have been 
proposed for fitting the NMF model, and most of them can be easily generalized to the tensor case. After 
a brief review of the existing algorithms for NMF, we compare our proposed algorithm to some of the best 
algorithms reported in the literature to showcase the efficiency of AO-ADMM. 

Let us start by considering NMF with least-squares loss, which is the prevailing loss function in practice. 
By adopting the alternating optimization framework, the sub-problem that emerges for each matrix factor 
is non-negative (linear) least-squares (NNLS). Some of the traditional methods for NNLS are reviewed 
in [15] (interestingly, not including ADMM), and most of them have been applied to NMF or non-negative 
PARAFAC, e.g., the active-set (AS) method [8,31] and block-principle-pivoting (BPP) [32,33]. Recall that 
in the context of the overall multi-linear factorization problem we actually need to solve a large number 
of (non-negative) least-squares problems sharing the same mixing matrix W, and in the unconstrained 
case this means we only need to calculate the Cholesky factorization of W^W once. Unfortunately, this 
good property that enables high efficiency implementation of ALS is not preserved by either AS or BPP. 
Sophisticated methods that group similar rows to reduce the number of inversions have been proposed [57], 
although as k grows larger this does not seem appealing in the worst case. Some other methods, like the 
multiplicative-update (MU) [37] or hierarchical alternating least squares (HALS) [17], ensure that the per- 
iteration complexity is dominated by calculating W^W and W^Y, although more outer-loops are needed 
for convergence. These are actually one step majorization-minimization or block coordinate descent applied 
to the NNLS problem. An accelerated version of MU and HALS is proposed in [23], which essentially does 
a few more inner-loops after computing the most expensive W^Y. 
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ADMM, on the other hand, may not be the fastest algorithm for a single NNLS problem, yet its overhead 
can be amortized when there are many NNLS problem instances sharing the same mixing matrix, especially 
if good initialization is readily available. This is in contrast to an earlier attempt to adopt ADMM to 
NMF [10], which did not use Cholesky caching, warm start, and a good choice of p to speed up the algorithm. 
Furthermore, ADMM can seamlessly incorporate different regularizations as well as non-least-squares loss. 

We should emphasize that AO forms the backbone of our proposed algorithm - ADMM is only applied 
to the sub-problems. There are also algorithms that directly apply an ADMM approach to the whole 
problem [38,53,60]. The per-iteration complexity of those algorithms is also the same as the unconstrained 
alternating least-squares. However, due to the non-convexity of the whole problem, the loss is not guaranteed 
to decrease monotonically, unlike alternating optimization. Moreover, both ADMM and AO guarantee that 
every limit point is a stationary point, but in practice AO almost always converges (as long as the updates 
stay bounded), which is not the case for ADMM applied to the whole problem. 

In another recent line of work [59], a similar idea of using an improved AO framework to ensure conver¬ 
gence is used. When [59] is specialized to non-negative matrix/tensor factorization, each update becomes a 
simple proximal-gradient step with an extrapolation. The resulting algorithm is also guaranteed to converge 
(likewise assuming that the iterates remain bounded), but it turns out to be slower than our algorithm, as 
we will show in our experiments. 

To apply our proposed algorithm to NMF or non-negative PARAFAC with least-squares loss, Alg. 1 is 
used to solve the sub-problems, with line 8 customized as 


H ^ 


-U 


i.e., zeroing out the negative values of (H^ — U). The tolerance for the ADMM inner-loop is set to 0.01. 


5.1.1 Non-negative matrix factorization 

We compare AO-ADMM with the following algorithms: 

AO-BPP. AO using block principle pivoting [32]^; 
accHALS. Accelerated HALS [23]^; 

APG. Alternating proximal gradient [59]^; 

ADMM. ADMM applied to the whole problem [60] 

AO-BPP and HALS are reported in [32] to outperform other methods, accHALS is proposed in [23] to 
improve HALS, APG is reported in [59] to outperform AO-BPP, and we include ADMM applied to the 
whole problem to compare the convergence behavior of AO and ADMM for this non-convex factorization 
problem. 

The aforementioned NMF algorithms are tested on two datasets. One is a dense image data set, the 
Extended Yale Face Database B®, of size 32256 x 1932, where each column is a vectorized 168 x 192 image 
of a face, and the dataset is a collection of face images of 29 subjects under various poses and illumination 
conditions. The other one is the Topic Detection and Tracking 2 (TDT2) text corpus®, of size 10212 x 36771, 
which is a sparse document-term matrix where each entry counts the frequency of a term in one document. 

The convergence of the relative error || Y — WH^||f/|| versus time in seconds for the Extended Yale 
B dataset is shown in Fig. 1, with k = 100 on the left and k = 300 on the right; and for the TDT2 dataset 
in Fig. 2, with k = 500 on the left and k = 800 on the right. The ADMM algorithm [60] is not included for 
TDT2 because the code provided online is geared towards imputation of matrices with missing values - it 
does not treat a sparse input matrix as the full data, unless we fill-in all zeros. 

^http: //www. cc .gatech. edu/-hpark/iimf software .php 

^https://sites.google.com/site/nicolasgillls/code 

^http://www.math.ucla.edu/-wotaoyin/papers/bcu/matlab.html 

"^http: //mcnf. blogs. rice. edu/ 

®http://vision.ucsd.edu/-leekc/ExtYaleDatabase/ExtYaleB.html 
®http://www.cad.zju.edu.cn/home/dengcai/Data/TextData.html 
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Figure 1: Convergence of some NMF algorithms on the Extended Yale B dataset. 
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Figure 2: Convergence of some NMF algorithms on the TDT2 dataset. 
Table 1: Averaged performance of NMF algorithms on synthetic data. 


Algorithm 

||Y-WH^IIf 

run time 

iterations 

AO-ADMM 

193.1026 

21.7s 

86.9 

AO-BPP 

193.1516 

40.9s 

52.2 

accHALS 

193.1389 

26.8s 

187.0 

APG 

193.1431 

25.3s 

240.2 

ADMM 

193.6808 

31.9s 

125.2 


We also tested these algorithms on synthetic data. For m = n = 2000 and k = 100, the true W and H 
are generated by drawing their elements from an i.i.d. exponential distribution with mean 1, and then 50% 
of the elements are randomly set to 0. The data matrix Y is then set to be Y = WH^ + N, where the 
elements of N are drawn from an i.i.d. Gaussian distribution with variance 0.01. The averaged results of 
100 Monte-Carlo trials are shown in Table 1. As we can see, AO-based methods are able to attain smaller 
fitting errors than directly applying ADMM to this non-convex problem, while AO-ADMM provides the 
most efficient per-iteration complexity. 
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Figure 3: Convergence of some non-negative PARAFAC algorithms on the CT dataset. 


Table 2: Averaged performance of non-negative PARAFAC algorithms on synthetic data 


Algorithm 

||Y- [Hi, H2, Halil 

run time 

iterations 

AO-ADMM 

1117.597 

145.2s 

25.1 

AO-BPP 

1117.728 

679.0s 

22.6 

HALS 

1117.655 

1838.7s 

137.7 

APG 

1117.649 

1077.4s 

156.3 

ADMM 

1156.799 

435.9s 

77.2 

Tensorlab 

1118.427 

375.8s 

N/A 


5.1.2 Non-negative PARAFAC 

Similar algorithms are compared in the non-negative PARAFAC case: 

AO-BPP. AO using block principle pivoting [33]^; 

HALS. Hierarchical alternating least-squares [17]^; 

APG. Alternating proximal gradient [59]^; 

ADMM. ADMM applied to the whole problem [38]. 

For our proposed AO-ADMM algorithm, a diminishing proximal regularization term in the form (6) is added 
to each sub-problem to enhance the overall convergence, with the regularization parameter /i updated as 
(14). 

Two real datasets are being tested: one is a dense CT image dataset^ of size 260 x 190 x 150, which 
is a collection of 150 CT images of a female’s ankle, each with size 260 x 190; the other one is a sparse 
social network dataset - Facebook Wall Posts®, of size 46952 x 46951 x 1592, that collects the number of 
wall posts from one Facebook user to another over a period of 1592 days. The sparse tensor is stored in 
the sptensor format supported by the tensor_toolbox [5], and all the aforementioned algorithms use this 
toolbox to handle sparse tensor data. 

Similar to the matrix case, the normalized root mean squared error versus time in seconds for the CT 
dataset is shown in Fig. 3, with fc = 10 on the left and fc = 30 on the right, and that for the Facebook Wall 
Posts data is shown in Fig. 4, with k = 30 on the left and k = 100 on the right. As we can see, AO-ADMM 

^http://www.nlm.nih.gov/research/vislble/ 

®http://konect.uni-koblenz.de/networks/facebook-wosn-wall 
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Figure 4: Convergence of some non-negative PARAFAC algorithms on the Facebook Wall Posts dataset. 


again converges the fastest, not only because of the efficient per-iteration update from Alg. 1, but also thanks 
to the additional proximal regularization to help the algorithm avoid swamps, which are not uncommon in 
alternating optimization-based algorithms for tensor decomposition. 

Monte-Carlo simulations were also conducted using synthetic data for 3-way non-negative tensors with 
ni = n 2 = ns = 500 and k = 100, with the latent factors generated in the same manner as for the previous 
NMF synthetic data, and the tensor data generated as the low-rank model synthesized from those factors 
plus i.i.d. Gaussian noise with variance 0.01. The averaged result over 100 trials is given in Table 2. For 
this experiment we have also included Tensorlab [51], which handles non-negative PARAFAC using “all-at- 
once” updates based on the Gauss-Newton method. As we can see, AO-ADMM again outperforms all other 
algorithms in all cases considered. 

5.2 Constrained Matrix and Tensor Completion 

As discussed before, real-world data are often stored as a sparse array, i.e., in the form of (index,value) 
pairs. Depending on the application, the unlisted entries in the array can be treated as zeros, or as not (yet) 
observed but possibly nonzero. A well-known example of the latter case is the Netflix prize problem, which 
involves an array of movie ratings indexed by customer and movie. The data is extremely sparse, but the 
fact that a customer did not rate a movie does not mean that the customer’s rating of that movie would be 
zero - and the goal is actually to predict those unseen ratings to provide good movie recommendations. 

For matrix data with no constraints on the latent factors, convex relaxation techniques that involve the 
matrix nuclear norm have been proposed with provable matrix reconstruction bounds [12]. Some attempts 
have been made to generalize the matrix nuclear norm to tensor data [21,39], but that boils down to the 
Tucker model rather than the PARAFAG model that we consider here. A key difference is that Tucker 
modeling can only hope to impute (recover missing values) in the data, whereas PARAFAG can uniquely 
recover the latent factors - the important ‘dimensions’ of consumer preference in this context. Another key 
difference is that the aforementioned convex relaxation techniques cannot incorporate constraints on the 
latent factors, which can improve the estimation performance. Taking the Netflix problem as an example, 
user-bias and movie-bias terms are often successfully employed in recommender systems; these can be easily 
subsumed in the factorization formulation by constraining, say, the first column of W and the second column 
of H to be equal to the all-one vector. Moreover, interpreting each column of W (H) as the appeal of a 
certain movie genre to the different users (movie ratings for a given type of user, respectively), it is natural 
to constrain the entries of W and H to be non-negative. 

When matrix/tensor completion is formulated as a constrained factorization problem using a loss function 
as in Sec. 3.3, there are traditionally two ways to handle it. One is directly using alternating optimization, 
although due to the random positions of the missing values, the least-squares problem for each row of H will 
involve a different subset of the rows of W, thus making the update inefficient even in the unconstrained 
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Figure 5: Illustration of the missing values in the Amino acids fluorescence data. 
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Figure 6: The emission loadings (H 2 ) produced by the N-way toolbox on the left, which uses EM, and by 
AO-ADMM on the right. 


case. A more widely used way is an instance of expectation-maximization (EM): one starts by filling the 
missing values with zeros, and then iteratively fits a (constrained) low-rank model and imputes the originally 
missing values with predictions from the interim low-rank model. More recently, an ADMM approach that 
uses an auxiliary variable for the full data was proposed [60], although if we look carefully at that auxiliary 
variable, it is exactly equal to the filled-in data given by the EM method. 

In fact, the auxiliary variable Y that we introduce is similar to that of [60], thus also related to the way 
that EM imputes the missing values—one can treat our method as imputing the missing values per ADMM 
inner-loop, the method in [60] as imputing per iteration, and EM as imputing after several iterations. 
However, our proposed AO-ADMM is able to give better results than EM, despite the similarities. As an 
illustrative example, consider the Amino acids fluorescence data®, which is a 5 x 201 x 61 tensor known to 
be generated by a rank-3 non-negative PARAFAC model. However, some of the entries are known to be 
badly contaminated, and are thus deleted, as shown in Eig. 5. Imposing non-negativity on the latent factors, 
the emission loadings H 2 of the three chemical components provided by the EM method using the iV-way 
toolbox [3] and AO-ADMM are shown in Eig. 6. While both results are satisfactory, AO-ADMM is able to 
suppress the artifacts caused by the systematically missing values in the original data, as indicated by the 
arrows in Eig. 6. 

We now evaluate our proposed AO-ADMM on a movie rating dataset called MovieLens^®, which consists 
of 100,000 movie ratings from 943 users on 1682 movies. MovieLens includes 5 sets of 80%-20% splits 
of the ratings for training and testing, and for each split we fit a matrix factorization model based on 
the 80% training data, and evaluate the correctness of the model on the 20% testing data. The averaged 
performance on this 5-fold cross validation is shown in Eig. 7, where we used the mean absolute error (MAE) 

®http: //www. models .kvl .dk/Aniino_Acid_fluo 
: //grouplens. org/datasets/movielens/ 
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Figure 7: Training and testing mean absolute error (MAE) versus model rank of the MovieLens data, 
averaged over a 5-fold cross validation, comparing least-squares fitting (on the left) and Kullback-Leibler 
fitting (on the right), with Tikhonov regularization, non-negativity constraint, or non-negativity with biases 
on the latent factors. 


for comparison with the classical collaborative filtering result [47] (which attains a MAE of 0.73). On the 
left of Fig. 7, we used the traditional least-squares criterion to fit the available ratings, whereas on the right 
we used the Kullback-Leibler divergence for fitting, since it is a meaningful statistical model for integer data. 
For each fitting criterion, we compared the performance by imposing Tikhonov regularization (A/2)|| • |||. 
with A = 0.1, or non-negativity, or non-negativity with biases (i.e., in addition constraining the first column 
of W and second column of H to be all ones). Some observations are as follows: 

• Low-rank indeed seems to be a good model for this movie rating data, and the right rank seems to be 
4 or 5, higher rank leads to over-fitting, as evident from Fig. 7; 

• Imposing non-negativity reduces the over-fitting at higher ranks, whereas the fitting criterion does not 
seem to be playing a very important role in terms of performance; 

• By adding biases, the best case prediction MAE at rank 4 is less than 0.69, an approximately 6% 
improvement over the best result reported in [47]. 

Notice that our aim here is to showcase how AO-ADMM can be used to explore possible extentions to the 
matrix completion problem formulation, rather than come up with the best recommender system method, 
which would require significant exploration in its own right. We believe with the versability of AO-ADMM, 
researchers can easily test various models for matrix/tensor completion, and quickly narrow down the one 
that works the best for their specific application. 

5.3 Dictionary Learning 

Many natural signals can be represented as an (approximately) sparse linear combination of some (possibly 
over-complete) basis, for example the Fourier basis for speech signals and the wavelet basis for images. 
If the basis (or dictionary when over-complete) is known, one can directly do data compression via greedy 
algorithms or convex relaxations to obtain the sparse representation [9], or even design the sensing procedure 
to reduce the samples required for signal recovery [13]. If the dictionary is not known, then one can resort 
to the so called dictionary learning (DL) to try to learn a sparse representation [55], if one exists. The 
well-known benchmark algorithm for DL is called A:-SVD [2], which is a geometry-based algorithm, and can 
be viewed as a generalization of the clustering algorithms fc-means and A:-planes. However, as noted in the 
original paper, Ic-SVD does not scale well as the size of the dictionary increases. Thus fc-SVD is often used 
to construct a dictionary of small image patches of size 8x8, with a few hundreds of atoms. 
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DL can also be formulated as a matrix factorization problem 

minimize -IIY — DSII p + r(S) 

D,s 2 " ^ ’ (15) 

subject to D € T>, 

where r(-) is a sparsity inducing regularization, e.g., the cardinality, the li norm, or the log penalty; concep¬ 
tually there is no need for a constraint on D, however, due to the scaling ambiguity inherent in the matrix 
factorization problem, we need to impose some norm constraint on the scaling of D to make the problem 
better defined. For example, we can bound the norm of each atom in the dictionary, ||di|| < l,Vz = 1,..., k, 
where is the z-th column of D, and we adopt this constraint here. 

Although bounding the norm of the columns of D works well, it also complicates the update of D—without 
this constraint, each row of D is the solution of an independent least-squares problem sharing the same 
mixing matrix, while the constraint couples the columns of D, making the problem non-separable. Existing 
algorithms either solve it approximately [45] or by sub-optimal methods like cyclic column updates [40]. On 
the other hand, this is not a problem at all for our proposed ADMM sub-routine Alg. 1: the row separability 
of the cost function and the column separability of the constraints are handled separately by the two primal 
variable blocks, while our previously discussed Cholesky caching, warm starting, and good choice of p ensure 
that an exact dictionary update can be done very efficiently. 

The update of S, sometimes called the sparse coding step, is a relatively well-studied problem for which 
numerous algorithms have been proposed. We mainly focus on the h regularized formulation, in which case 
the sub-problem becomes the well-known LASSO, and in fact a large number of LASSOs sharing the same 
mixing matrix. Alg. 1 can be used by replacing the proximity step with the soft-thresholding operator. 
Furthermore, if an over-complete dictionary is trained, the least-squares step can also be accelerated by 
using the matrix inversion lemma: 

(D^D -h pl)-i = p-^1 - -f DD^)-iD. 

Thus, if m <C fc, one can cache the Cholesky of pi + DD^ = LL^ instead, and replace the least-squares step 
in Alg. 1 with 

S ^ - D^L-^L-^DB), 

where B = D^Y -|- p(S -|- U). The use of ADMM for LASSO is also discussed in [1,20,61], and [7], and we 
generally followed the one described in [7, §6]. Again, one should notice that compared to a plain LASSO, 
our LASSO sub-problem in the AO framework comes with a good initialization, therefore only a very small 
number of ADMM-iterations are required for convergence. 

It is interesting to observe that for the particular constraints and regularization used in DL, incorporating 
non-negativity maintains the simplicity of our proposed algorithm—for both the norm bound constraint and 
h regularization, the proximity operator in Alg. 1 with non-negativity constraint simply requires zeroing out 
the negative values before doing the same operations. In some applications non-negativity can greatly help 
the identification of the dictionary [28]. 

As an illustrative example, we trained a dictionary from the MNIST handwritten digits dataset^^, which is 
a collection of gray-scale images of handwritten digits of size 28 x 28, and for each digit we randomly sampled 
1000 images, forming a matrix of size 784 x 10, 000. Non-negativity constraints are imposed on both the 
dictionary and the sparse coefficients. For k = 100, and by setting the h penalty parameter A = 0.5, the 
trained dictionary after 100 AO-ADMM (outer-)iterations is shown in Fig. 8. On average approximately 11 
atoms are used to represent each image, and the whole model is able to describe approximately 60% of the 
energy of the original data, and the entire training time takes about 40 seconds. Most of the atoms in the 
dictionary remain readable, which shows the good interpretability afforded by the additional non-negativity 
constraint. 

For comparison, we tried the same data set with the same parameter settings with the popular and well- 
developed DL package SPAMS^^. For fair comparison, we used SPAMS in batch mode with batch size equal 

^^http://www.cs.nyu.edu/-roweis/data.html 

//spams-devel.gforge.inria.fr/lndex.html 
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Figure 8: Trained dictionary from the MNIST handwritten digits dataset. 

to the size of the training data, and run it for 100 iterations (same number of iterations as AO-ADMM). 
The quality of the SPAMS dictionary is almost the same as that of AO-ADMM, but it takes SPAMS about 
3 minutes to run through these 100 iterations, versus 40 seconds for AO-ADMM. The performance does not 
change much if we remove the non-negativity constraint when using SPAMS, although the resulting dictionary 
then loses interpretability. Notice that SPAMS is fully developed in C-I--I-, whereas our implementation is 
simply written in MATLAB, which leaves considerable room for speed improvement using a lower-level 
language compiler. 

6 Conclusion 

In this paper we proposed a novel AO-ADMM algorithmic framework for matrix and tensor factorization un¬ 
der a variety of constraints and loss functions. The main advantages of the proposed AO-ADMM framework 
are: 

• Efficiency. By carefully adopting AO as the optimization backbone and ADMM for the individual 
sub-problems, a significant part of the required computations can be effectively cached, leading to 
a per-iteration complexity similar to the workhorse ALS algorithm for unconstrained factorization. 
Warm-start that is naturally provided by AO together with judicious regularization and choice of 
parameters further reduce the number of inner ADMM and outer AO iterations. 

• Universality. Thanks to ADMM, which is a special case of the proximal algorithm, non-least-squares 
terms can be handled efficiently with element-wise complexity using the well-studied proximity op¬ 
erators. This includes almost all non-parametric constraints and regularization penalties commonly 
imposed on the factors, and even non-least-squares fitting criteria. 

• Convergence. AO guarantees monotone decrease of the loss function, which is a nice property for 
the NP-hard factorization problems considered. Moreover, recent advances on generalizations of the 
traditional BCD algorithms further guarantee convergence to a stationary point. 

Case studies on non-negative matrix/tensor factorization, constrained matrix/tensor completion, and 
dictionary learning, with extensive numerical experiments using real data, corroborate our main claims. We 
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believe that AO-ADMM can serve as a plug-and-play framework that allows easy exploration of different 
types of constraints and loss functions, as well as different types of matrix and tensor (co-)factorization 
models. 


References 

[1] M. V. Afonso, J. M. Bioucas-Dias, and M. A. T. Figueiredo, “Fast image recovery using variable splitting 
and constrained optimization,” IEEE Trans, on Image Processing, vol. 19, no. 9, pp. 2345-2356, 2010. 

[2] M. Aharon, M. Elad, and A. Bruckstein, “fc-SVD: An algorithm for designing overcomplete dictionaries 
for sparse representation,” IEEE Trans, on Signal Processing, vol. 54, no. 11, pp. 4311-4322, 2006. 

[3] C. A. Andersson and R. Bro, “The N-way toolbox for matlab,” Chemometrics and Intelligent Laboratory 
Systems, vol. 52, no. 1, pp. 1-4, 2000. 

[4] B. W. Bader and T. G. Kolda, “Efficient MATLAB computations with sparse and factored tensors,” 
SIAM Journal on Scientific Computing, vol. 30, no. 1, pp. 205-231, 2007. 

[5] B. W. Bader, T. G. Kolda et al., “Matlab tensor toolbox version 2.6,” http://www.sandia.gov/~tgkolda/ 
Tensor Toolbox/, February 2015. 

[6] D. P. Bertsekas, Nonlinear programming. Athena Scientific, 1999. 

[7] S. P. Boyd, N. Parikh, E. Ghu, B. Peleato, and J. Eckstein, “Distributed optimization and statistical 
learning via the alternating direction method of multipliers,” Foundations and Trends@ in Machine 
Learning, vol. 3, no. 1, pp. 1-122, 2011. 

[8] R. Bro and S. De Jong, “A fast non-negativity-constrained least squares algorithm,” Journal of Chemo¬ 
metrics, vol. 11, no. 5, pp. 393-401, 1997. 

[9] A. M. Bruckstein, D. L. Donoho, and M. Elad, “Erom sparse solutions of systems of equations to sparse 
modeling of signals and images,” SIAM review, vol. 51, no. 1, pp. 34-81, 2009. 

[10] X. Gai, Y. Chen, and D. Han, “Nonnegative tensor factorizations using an alternating direction method,” 
Frontiers of Mathematics in China, vol. 8, no. 1, pp. 3-18, 2013. 

[11] E. J. Candes, X. Li, Y. Ma, and J. Wright, “Robust principal component analysis?” Journal of the 
ACM, vol. 58, no. 3, p. 11, 2011. 

[12] E. J. Candes and B. Recht, “Exact matrix completion via convex optimization,” Foundations of Com¬ 
putational Mathematics, vol. 9, no. 6, pp. 717-772, 2009. 

[13] E. J. Candes and M. B. Wakin, “An introduction to compressive sampling,” IEEE Signal Processing 
Magazine, vol. 25, no. 2, pp. 21-30, 2008. 

[14] B. Chen, S. He, Z. Li, and S. Zhang, “Maximum block improvement and polynomial optimization,” 
SIAM Journal on Optimization, vol. 22, no. 1, pp. 87-107, 2012. 

[15] D. Chen and R. J. Plemmons, “Nonnegativity constraints in numerical analysis,” in Symposium on the 
Birth of Numerical Analysis, 2009, pp. 109-140. 

[16] J. H. Choi and S. V. N. Vishwanathan, “DFacTo: Distributed factorization of tensors,” in Advances in 
Neural Information Processing Systems, 2014, pp. 1296-1304. 

[17] A. Cichocki and A.-H. Phan, “Fast local algorithms for large scale nonnegative matrix and tensor fac¬ 
torizations,” lEICE Trans, on Fundamentals of Electronics, Communications and Computer Sciences, 
vol. 92, no. 3, pp. 708-721, 2009. 


21 



[18] J. Duchi, S. Shalev-Shwartz, Y. Singer, and T. Chandra, “Efficient projections onto the ?i-ball for 
learning in high dimensions,” in Proc. ACM ICML, 2008, pp. 272-279. 

[19] C. Eckart and G. Young, “The approximation of one matrix by another of lower rank,” Psvchometrika, 
vol. 1, no. 3, pp. 211-218, 1936. 

[20] E. Esser, Y. Lou, and J. Xin, “A method for finding structured sparse solutions to nonnegative least 
squares problems with applications,” SIAM Journal on Imaging Sciences, vol. 6, no. 4, pp. 2010-2046, 
2013. 

[21] S. Gandy, B. Recht, and 1. Yamada, “Tensor completion and low-n-rank tensor recovery via convex 
optimization,” Inverse Problems, vol. 27, no. 2, p. 025010, 2011. 

[22] E. Ghadimi, A. Teixeira, 1. Shames, and M. Johansson, “Optimal parameter selection for the alternat¬ 
ing direction method of multipliers (ADMM): quadratic problems,” IEEE Transactions on Automatic 
Control, vol. 60, no. 3, pp. 644-658, March 2015. 

[23] N. Gillis and E. Glineur, “Accelerated multiplicative updates and hierarchical als algorithms for non¬ 
negative matrix factorization,” Neural Computation, vol. 24, no. 4, pp. 1085-1105, 2012. 

[24] G. H. Golub and C. F. Van Loan, Matrix Computations. Johns Hopkins University Press, 1996, vol. 3. 

[25] L. Grippo and M. Sciandrone, “On the convergence of the block nonlinear Gauss-Seidel method under 
convex constraints,” Operations Research Letters, vol. 26, no. 3, pp. 127-136, 2000. 

[26] C. J. Hillar and L.-H. Lim, “Most tensor problems are NP-hard,” Journal of the ACM, vol. 60, no. 6, 
p. 45, 2013. 

[27] T. Hofmann, “Probabilistic latent semantic indexing,” in Proc. ACM SIGIR Conference, 1999, pp. 
50-57. 

[28] P. O. Hoyer, “Non-negative matrix factorization with sparseness constraints,” Journal of Machine Learn¬ 
ing Research, vol. 5, pp. 1457-1469, 2004. 

[29] K. Huang, N. D. Sidiropoulos, and A. Swami, “Non-negative matrix factorization revisited: Uniqueness 
and algorithm for symmetric decomposition,” IEEE Trans, on Signal Processing, vol. 62, no. 1, pp. 
211-224, Jan 2014. 

[30] U. Kang, E. E. Papalexakis, A. Harpale, and C. Faloutsos, “GigaTensor: scaling tensor analysis up by 
100 times-algorithms and discoveries,” in Proc. ACM SIGKDD, 2012, pp. 316-324. 

[31] H. Kim and H. Park, “Nonnegative matrix factorization based on alternating nonnegativity constrained 
least squares and active set method,” SIAM Journal on Matrix Analysis and Applications, vol. 30, no. 2, 
pp. 713-730, 2008. 

[32] J. Kim and H. Park, “Fast nonnegative matrix factorization: An active-set-like method and compar¬ 
isons,” SIAM Journal on Scientific Computing, vol. 33, no. 6, pp. 3261-3281, 2011. 

[33] -, “Fast nonnegative tensor factorization with an active-set-like method,” in High-Performance Sci¬ 

entific Computing. Springer, 2012, pp. 311-326. 

[34] T. G. Kolda and B. W. Bader, “Tensor decompositions and applications,” SIAM review, vol. 51, no. 3, 
pp. 455-500, 2009. 

[35] J. B. Kruskal, R. A. Harshman, and M. E. Lundy, “How 3-MFA data can cause degenerate PARAFAG 
solutions, among other relationships,” in Multiway Data Analysis, 1989, pp. 115-122. 


22 



[36] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative matrix factorization,” Nature, 
vol. 401, no. 6755, pp. 788-791, 1999. 

[37] -, “Algorithms for non-negative matrix factorization,” in Advances in Neural Information Processing 

Systems (NIPS), 2001, pp. 556-562. 

[38] A. P. Liavas and N. D. Sidiropoulos, “Parallel algorithms for constrained tensor factorization via the 
alternating direction method of multipliers,” IEEE Trans, on Signal Processing, 2014, submitted. 

[39] J. Liu, P. Musialski, P. Wonka, and J. Ye, “Tensor completion for estimating missing values in visual 
data,” IEEE Trans, on Pattern Analysis and Machine Intelligence, vol. 35, no. 1, pp. 208-220, 2013. 

[40] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrix factorization and sparse 
coding,” Journal of Machine Learning Research, vol. 11, pp. 19-60, 2010. 

[41] B. A. Olshausen and D. J. Field, “Sparse coding with an overcomplete basis set: A strategy employed 
by VI?” Vision Research, vol. 37, no. 23, pp. 3311-3325, 1997. 

[42] N. Parikh and S. P. Boyd, “Proximal algorithms,” Foundations and Trends® in Optimization, vol. 1, 
no. 3, pp. 123-231, 2014. 

[43] N. Ravindran, N. D. Sidiropoulos, S. Smith, and G. Karypis, “Memory-efficient parallel computation of 
tensor and matrix products for big tensor decomposition,” in Asilomar Conference on Signals, Systems, 
and Computers, 2014. 

[44] M. Razaviyayn, M. Hong, and Z.-Q. Luo, “A unified convergence analysis of block successive mini¬ 
mization methods for nonsmooth optimization,” SIAM Journal on Optimization, vol. 23, no. 2, pp. 
1126-1153, 2013. 

[45] M. Razaviyayn, H.-W. Tseng, and Z.-Q. Luo, “Dictionary learning for sparse representation: Complexity 
and algorithms,” in Proc. IEEE ICASSP, 2014, pp. 5247-5251. 

[46] E. Ryu and S. P. Boyd, “Primer on monotone operator methods,” Preprint, available at http://web. 
stanford.edu/~eryu/papers/monotone_notes.pdf. 

[47] B. Sarwar, G. Karypis, J. Konstan, and J. Riedl, “Item-based collaborative filtering recommendation 
algorithms,” in Proceedings of the 10th International Conference on World Wide Web, 2001, pp. 285- 
295. 

[48] N. D. Sidiropoulos and R. Bro, “On the uniqueness of multilinear decomposition of N-way arrays,” 
Journal of chemometrics, vol. 14, no. 3, pp. 229-239, 2000. 

[49] A. Smilde, R. Bro, and P. Geladi, Multi-way analysis: applications in the chemical sciences. John 
Wiley & Sons, 2005. 

[50] S. Smith, N. Ravindran, N. D. Sidiropoulos, and G. Karypis, “SPLATT: Efficient and parallel sparse 
tensor-matrix multiplication,” in IEEE International Parallel & Distributed Processing Symposium, 
2015. 

[51] L. Sorber, M. Van Barel, and L. De Lathauwer, “Tensorlab v2.0,” http://www.tensorlab.net/, Jan. 
2014. 

[52] A. Stegeman, “Finding the limit of diverging components in three-way candecomp/parafac-a 
demonstration of its practical merits,” Comput. Stat. Data Anal., vol. 75, pp. 203-216, Jul. 2014. 
[Online]. Available: http://dx.doi.Org/10.1016/j.csda.2014.02.010 

[53] D. L. Sun and C. Fevotte, “Alternating direction method of multipliers for non-negative matrix factor¬ 
ization with the beta-divergence,” in Proc. IEEE ICASSP, 2014, pp. 6201-6205. 


23 



[54] G. Tomasi and R. Bro, “A comparison of algorithms for fitting the PARAFAC model,” Computational 
Statistics & Data Analysis, vol. 50, no. 7, pp. 1700-1734, 2006. 

[55] I. Tosic and P. Frossard, “Dictionary learning,” IEEE Signal Processing Magazine, vol. 28, no. 2, pp. 
27-38, 2011. 

[56] P. Tseng, “Convergence of a block coordinate descent method for nondifferentiable minimization,” 
Journal of Optimization Theory and Applications, vol. 109, no. 3, pp. 475-494, 2001. 

[57] M. H. Van Benthem and M. R. Keenan, “Fast algorithm for the solution of large-scale non-negativity- 
constrained least squares problems,” Journal of Chemometrics, vol. 18, no. 10, pp. 441-450, 2004. 

[58] S. A. Vavasis, “On the complexity of nonnegative matrix factorization,” SIAM Journal on Optimization, 
vol. 20, no. 3, pp. 1364-1377, 2009. 

[59] Y. Xu and W. Yin, “A block coordinate descent method for regularized multiconvex optimization with 
applications to nonnegative tensor factorization and completion,” SIAM Journal on Imaging Sciences, 
vol. 6, no. 3, pp. 1758-1789, 2013. 

[60] Y. Xu, W. Yin, Z. Wen, and Y. Zhang, “An alternating direction algorithm for matrix completion with 
nonnegative factors,” Erontiers of Mathematics in China, vol. 7, no. 2, pp. 365-384, 2012. 

[61] J. Yang and Y. Zhang, “Alternating direction algorithms for ^i-problems in compressive sensing,” SIAM 
Journal on Scientific Computing, vol. 33, no. 1, pp. 250-278, 2011. 


24 



